GH-17211: [C++] Add hash_64 scalar compute function #39836

drin · 2024-01-29T17:08:17Z

This PR is a fresh continuation of #13487 which was closed purely in favor of this one. The reason for this is that the other PR carried a lot of "development burden" in the form of a long commit history. This fresh version maintains the current state of the code in much fewer commits with a cleaner history. Hopefully this PR will allow others to finish what I started (if I don't finish this first).

The changes still required is to add functional mechanisms for the following:

A TempVectorStack for local allocation within the kernel
- This is necessary to allocate a modified structure of the input data to then be passed to the hashing functions
Functional testing of a kernel that can handle a nested structure (e.g. a List[List[utf-8]]).
Closes: [C++][Compute] Add scalar_hash function #17211

judahrand · 2024-06-03T14:02:01Z

@drin are you likely to get some time to push this forward? #32504 is causing us issues and I'd be great to move towards solving it!

drin · 2024-06-03T14:13:23Z

sure! next week I'll have more time and I'll try to squeeze in a bit of time this week.

drin · 2024-06-19T20:50:09Z

@judahrand sorry, I ended up taking a break last week, but I happened to remember this today. I will work on this issue this week and next week and my goal will be to have it finished (or close to finished) by end of next week.

I'll try to stage changes so that I can validate on a single nested structure like what is described in #32504 (list<>) and once I have some tests for that working I will try to accommodate arbitrary nesting (list<list<...>>).

drin · 2024-06-28T00:09:39Z

rebased on main branch

drin · 2024-07-03T22:05:50Z

@judahrand, I wanted to verify something with you since I assume you have an immediate use case in mind for #32504 .

The naive implementation of hashing for a "shallow" nested type (e.g. list<int8>) is relatively straightforward and the code available in this PR should be able to handle that already.

What held up this PR was logic for "deep" nested types (e.g. list<list<int8>>). To move towards addressing this, I am considering the following simple scenario and hoping to get your feedback that it is addressed reasonably (this is why it would be helpful if you have use cases in mind):

# A "shallow" nested type that we should be able to reproduce (I think)
Reference Test data [length: 3]:
[
  [ 1, 1, 2, 3,  4 ],
  [ 2, 2, 4, 6,  8 ],
  [ 3, 3, 6, 9, 12 ]
]

# A "deep" nested type that we may receive as input
Sample Test data (list<item: list<item: int64>>| [length: 3]):
[
  [ [ 1 ], [ 1, 2, 3,  4 ] ],
  [ [ 2 ], [ 2, 4, 6,  8 ] ],
  [ [ 3 ], [ 3, 6, 9, 12 ] ]
]

There is ambiguity in how a "deep" nested type should be hashed (maybe the nested structure should produce a different hash for a "row" of values), but what makes the most sense to me is for it to be value-based (the above example). In the case that a different approach is preferred, other approaches can either be encoded as future options or a UDF provided instead of the existing hashing functions.

I have prototyped a naive version of the flattening logic in a separate repo: recipe_convert.cpp#L150-L179. For context, this type of flattening would be used in the FastHashScalar::Exec function as a preprocessing step before calling the Hashing32::HashMultiColumn function.

Thanks for any feedback you can provide! Particularly on expected behavior for nested array types.

drin · 2024-07-03T22:09:00Z

An alternate approach is to return a

Status::NotImplemented("Cannot hash nested data types with many levels of nesting")

if any of the columns to hash are "deep" nested types.

judahrand · 2024-07-04T08:00:51Z

There is ambiguity in how a "deep" nested type should be hashed (maybe the nested structure should produce a different hash for a "row" of values), but what makes the most sense to me is for it to be value-based (the above example).

I'm not entirely sure I've understood your distinction between the two cases. Am I correct in understanding that in the first of the two options that you describe a row containing[[1], [1,2,3,4]] would have a different hash value to [[1,1], [2,3,4]] but in the second, 'value-based' option these two lists would have the same hash?

If that is the case then the 'value-based' behaviour would be surprising to me. I'd have thought that two rows with what is definitely different data should have different hashes (not considering the small chance of a hash collision).

For the case of nested listed I wonder if this issue could be avoided by including the offsets of the parent list in the hash?

drin · 2024-07-05T05:39:38Z

I'm not entirely sure I've understood your distinction between the two cases. Am I correct in understanding that in the first of the two options that you describe a row containing[[1], [1,2,3,4]] would have a different hash value to [[1,1], [2,3,4]] but in the second, 'value-based' option these two lists would have the same hash?

Oh yeah, I guess I didn't make that part clear. But yes, you understood correctly. I was in favor of [[1], [1, 2, 3, 4]] having the same hash as [[1, 1], [2, 3, 4]] (what I called the "value-based" option).

If you want those to have different hashes, then it may be possible to transform the list array into a struct array and one struct column can be all of the offsets coalesced and a second struct column can be all of the values (referenced). To do it any other way would require extending the existing hashing functions in non-trivial ways or adding new hashing functions entirely.

So, maybe for this PR I can simply do shallow nested arrays and someone can work on the deep nested arrays in aa follow-up PR.

This commit ports the latest state of scalar_hash kernels without pulling a long development history. This kernel is an element-wise function that uses an xxHash-like algorithm, prioritizing speed and not suitable for cryptographic purposes. The function is implemented by the `FastHashScalar` struct which is templated by the output type (which is assumed to be either UInt32 or UInt64, but there is no validation of that at the moment). The benchmarks in scalar_hash_benchmark.cc uses the hashing_benchmark.cc file as a reference (in cpp/src/arrow/util/), but only covers various input types and the key hashing functions (from key_hash.h). The tests in scalar_hash_test.cc use a simplified version of hashing based on what is implemented in the key_hash.cc. The idea being that the high-level entry points for high-level types should eventually reach an expected application of the low-level hash functions on simple data types; the tests do this exact comparison. At the moment, the tests pass for simple cases, but they do not work for nested types with non-trivial row layouts (e.g. ListTypes). Issue: ARROW-8991 Issue: apacheGH-17211

This commit pulls the latest changes to key_hash.h and implementations in light_array without the burden of a long development history. The only change in key_hash.h is the addition of a friend function which is used in scalar_hash_test.cc. Changes in light_array.[h,cc] are to accommodate two scenarios: (1) the use of ArraySpan, which was introduced after light_array was written; and (2) the need for a KeyColumnArray to allocate data for the purposes of interpreting (or decoding) the structure of a nested type. The main reason for the 2nd scenario is that a ListArray may have many values represented in a single row which should be hashed together; however, if the ListArray has a nested ListArray or other type, the row may have further structure. In the simplest interpretation, only the highest-level structure (the "outer" ListArray) needs to be preserved, and any further nested structures must be explicitly handled by custom kernels (or any future options, etc. that are upstreamed into Arrow). In trying to efficiently interpret complex nested types, ArraySpan can be useful because it is non-owning, thus the main reason for the 1st aforementioned scenario. Although unfinished, any tests added to light_array_test.cc should accommodate the 2 scenarios above.

This commit includes changes to register a new compute function without the burden of a long development history. The change to cpp/src/arrow/CMakeLists.txt includes scalar_hash.cc in compilation as it is used by the new Hash64 function defined in api_scalar.[h,cc]. The change to cpp/src/arrow/compute/kernels/CMakeLists.txt includes scalar_hash_test.cc in compilation for tests and it also adds a new benchmark binary that is implemented by scalar_hash_benchmark.cc. The registry files are updated to register the kernel implementations in scalar_hash.cc with the function definitions in api_scalar.[h,cc]. Finally, docs/source/cpp/compute.rst adds documentation for the Hash64 function. Issue: apacheGH-17211 Issue: ARROW-8991

This commit includes additions to the general hashing benchmarks that cover the use of hashing functions in key_hash.h without carrying the burden of a long dev history. Some existing benchmark names were changed to distinguish between the use of Int32 and Int64 types, new benchmarks were added that use the functions declared in key_hash.h. The reason the new benchmarks are added is because it is claimed they prioritize speed over cryptography as they're primarily used for join algorithms and other processing tasks, which the hashing benchmark can now provide observability for. Issue: apacheGH-17211 Issue: ARROW-8991

This is primarily to make the next commit a bit more focused.

Previous commits tried to add allocation mechanisms to KeyColumnArray to accommodate arbitrarily nested types. While some allocation is necessary to accommodate arbitrarily nested types, the work is being split into incremental steps and scope is being reduced to the straightforward nested types. Specifically, we want to support: ListArray, StructArray, and MapArray. We **only** support these types if they contain primitive types (e.g. List<int8> not List<List<int8>>, or Map<int64, float32> not Map<string, int32>). The difficulty with nested types beyond the first level is that multiple buffers of offsets must be accommodated and that support requires changes that either: (1) will propagate through key_hash code or (2) requires a (temporary) allocation of an array that flattens offsets into a single offset buffer.

Needed to add a check when constructing a KeyColumnArray for a ListArray that the element type of the ListArray is primitive.

github-actions bot added Component: C++ Component: Documentation awaiting review Awaiting review labels Jan 29, 2024

drin changed the title ~~[C++] Add hash_64 scalar compute function~~ GH-17211: [C++] Add hash_64 scalar compute function Jan 29, 2024

apache deleted a comment from github-actions bot Jan 29, 2024

pitrou mentioned this pull request Mar 29, 2024

GH-40431: [C++] Move key_hash/key_map/light_array related files to internal for prevent using by users #40484

Merged

drin force-pushed the ARROW-8991-newfn-scalar-hash-fresh branch from 255d0be to 23558bf Compare June 28, 2024 00:09

drin added 5 commits October 7, 2024 13:28

added KeyColumnArray alias

b2af8c9

drin force-pushed the ARROW-8991-newfn-scalar-hash-fresh branch from 23558bf to b2af8c9 Compare October 7, 2024 22:18

drin added 3 commits October 11, 2024 05:46

task: revert style and other minor changes

304667b

This is primarily to make the next commit a bit more focused.

minor: added type check for ListArray

50c8b5c

Needed to add a check when constructing a KeyColumnArray for a ListArray that the element type of the ListArray is primitive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-17211: [C++] Add hash_64 scalar compute function #39836

GH-17211: [C++] Add hash_64 scalar compute function #39836

drin commented Jan 29, 2024 •

edited by github-actions bot

Loading

judahrand commented Jun 3, 2024

drin commented Jun 3, 2024

drin commented Jun 19, 2024 •

edited

Loading

drin commented Jun 28, 2024

drin commented Jul 3, 2024

drin commented Jul 3, 2024

judahrand commented Jul 4, 2024

drin commented Jul 5, 2024

GH-17211: [C++] Add hash_64 scalar compute function #39836

Are you sure you want to change the base?

GH-17211: [C++] Add hash_64 scalar compute function #39836

Conversation

drin commented Jan 29, 2024 • edited by github-actions bot Loading

judahrand commented Jun 3, 2024

drin commented Jun 3, 2024

drin commented Jun 19, 2024 • edited Loading

drin commented Jun 28, 2024

drin commented Jul 3, 2024

drin commented Jul 3, 2024

judahrand commented Jul 4, 2024

drin commented Jul 5, 2024

drin commented Jan 29, 2024 •

edited by github-actions bot

Loading

drin commented Jun 19, 2024 •

edited

Loading